Optimized text for full unicode and some escape sequences (#129169) by MitchLewis930 · Pull Request #7 · Signal65/elasticsearch-CodeRabbit

MitchLewis930 · 2026-01-29T05:42:05Z

PR_017

Summary by CodeRabbit

Bug Fixes
- Improved JSON string parsing to correctly handle escape sequences and multi-byte UTF-8 characters, enhancing accuracy in content parsing.
- Optimized string value extraction for improved performance and reliability.
Tests
- Significantly expanded test coverage for JSON string handling with randomized escape sequences, Unicode codepoints, and diverse input scenarios to ensure comprehensive validation.

…9169) Follow-up to elastic#126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to elastic#129072.

coderabbitai · 2026-01-29T05:42:16Z

📝 Walkthrough

Walkthrough

The JSON string parsing in ESUTF8StreamJsonParser is refactored from linear while-loop processing to a state machine-based approach using switch statements. This tracks escape sequences and string length separately, reconstructing output by removing escape characters when needed. Tests are rewritten with randomized input generation to validate the new parsing behavior.

Changes

Cohort / File(s)	Summary
JSON Parser State Machine `libs/x-content/impl/src/main/java/.../ESUTF8StreamJsonParser.java`	Replaced linear parsing loop with switch-based state machine handling ASCII characters, escape sequences, UTF-8 multi-byte sequences, and string termination. Added stringLength and backslashes tracking; modified getValueAsText to use computed length and conditionally reconstruct output when escapes are present.
Parser Tests & Validation `libs/x-content/impl/src/test/java/.../ESUTF8StreamJsonParserTests.java`	Rewrote testGetValueRandomized with manual input generation using TestInput record and ESCAPE_SEQUENCES mappings. Added helper methods for random codepoint and input building. Changed assertions to scenario-based logic supporting optimized text paths, replacing prior XContentBuilder approach.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A parser transformed with clever state,
Where backslashes dance, escape sequences wait,
Switch logic bounds through UTF-8 streams,
Tests randomized fulfill their dreams—
Strings now parsed with pristine precision! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: optimization of text parsing for full Unicode and some escape sequences in the JSON parser.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr_017_after

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

MitchLewis930 · 2026-01-29T05:56:50Z

@CodeRabbit full review

coderabbitai · 2026-01-29T05:57:00Z

✅ Actions performed

Full review triggered.

MitchLewis930 · 2026-02-12T16:26:07Z

@coderabbitai full review

coderabbitai · 2026-02-12T16:26:14Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In
`@libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java`:
- Around line 52-57: getValueAsText() currently returns a raw slice from
_inputBuffer when stringEnd > 0 which preserves escape sequences, causing
inconsistency after _finishAndReturnText() rebuilds de-escaped bytes; modify
getValueAsText() to detect when the token contains backslashes (or when
_finishAndReturnText() previously rebuilt bytes) and return the cached
de-escaped bytes instead of the original slice: call or reuse
_finishAndReturnText() (or store its result in a field like cachedTextBytes)
when escapes are present, and in the stringEnd fast path return a Text backed by
those de-escaped bytes (use identifiers _finishAndReturnText, cached buffer
field, stringEnd, _inputBuffer, _inputPtr, stringLength, _tokenIncomplete to
locate code).

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03ba5b1 and 96300a9.

📒 Files selected for processing (2)

libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java
libs/x-content/impl/src/test/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParserTests.java

🔇 Additional comments (7)

libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java (2)

24-31: State tracking additions look good.

The added length tracking and backslash bookkeeping are appropriate for the optimized path.

74-138: Switch-based parsing and buffer rebuild logic look solid.

The state machine and the split between direct-slice and rebuilt-buffer paths are clear and maintain the optimized intent.

libs/x-content/impl/src/test/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParserTests.java (5)

23-23: LGTM.

The Locale import is used by the new escape formatting.

40-95: Test updates for optimized Text path are clear and thorough.

The assertions now explicitly validate the Text-based path across escaped and unescaped cases.

134-145: Good addition of input/result modeling for escapes.

The TestInput record and ESCAPE_SEQUENCES table make test intent explicit.

146-191: Random input generator is well-structured.

The helper methods capture both optimized and fallback paths cleanly.

194-231: Randomized test flow aligns with new optimized behavior.

The branch logic matches the expected contract for getValueAsText() vs getValueAsString().

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-02-12T16:35:30Z

...tent/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java

    public Text getValueAsText() throws IOException {
        if (_currToken == JsonToken.VALUE_STRING && _tokenIncomplete) {
            if (stringEnd > 0) {
                final int len = stringEnd - 1 - _inputPtr;
-                // For now, we can use `len` for `stringLength` because we only support ascii-encoded unescaped strings,
-                // which means each character uses exactly 1 byte.
-                return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), len);
+                return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength);
            }


⚠️ Potential issue | 🟠 Major

Ensure repeated getValueAsText() calls return de‑escaped bytes.

When backslashes are present, _finishAndReturnText() returns a rebuilt buffer without escapes, but subsequent calls hit the stringEnd fast path and return a slice of the original input (still containing backslashes). That makes repeated calls inconsistent.

Consider caching the rebuilt bytes (or rebuilding on demand when backslashes exist) and using them in the stringEnd path.

🔧 One way to make repeated calls consistent

@@ - private final List<Integer> backslashes = new ArrayList<>(); + private final List<Integer> backslashes = new ArrayList<>(); + private byte[] stringBytes; @@ - if (stringEnd > 0) { - final int len = stringEnd - 1 - _inputPtr; - return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength); - } + if (stringEnd > 0) { + if (stringBytes != null) { + return new Text(new XContentString.UTF8Bytes(stringBytes), stringLength); + } + final int len = stringEnd - 1 - _inputPtr; + return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength); + } @@ - stringLength = 0; - backslashes.clear(); + stringLength = 0; + backslashes.clear(); + stringBytes = null; @@ - return new Text(new XContentString.UTF8Bytes(buff), stringLength); + stringBytes = buff; + return new Text(new XContentString.UTF8Bytes(buff), stringLength);

🤖 Prompt for AI Agents

In `@libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java` around lines 52 - 57, getValueAsText() currently returns a raw slice from _inputBuffer when stringEnd > 0 which preserves escape sequences, causing inconsistency after _finishAndReturnText() rebuilds de-escaped bytes; modify getValueAsText() to detect when the token contains backslashes (or when _finishAndReturnText() previously rebuilt bytes) and return the cached de-escaped bytes instead of the original slice: call or reuse _finishAndReturnText() (or store its result in a field like cachedTextBytes) when escapes are present, and in the stringEnd fast path return a Text backed by those de-escaped bytes (use identifiers _finishAndReturnText, cached buffer field, stringEnd, _inputBuffer, _inputPtr, stringLength, _tokenIncomplete to locate code).

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Optimized text for full unicode and some escape sequences (#129169)#7

Optimized text for full unicode and some escape sequences (#129169)#7
MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
pr_017_after

MitchLewis930 commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

MitchLewis930 commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 29, 2026

Uh oh!

MitchLewis930 commented Feb 12, 2026

Uh oh!

coderabbitai bot commented Feb 12, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

MitchLewis930 commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

MitchLewis930 commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 29, 2026

Uh oh!

MitchLewis930 commented Feb 12, 2026

Uh oh!

coderabbitai bot commented Feb 12, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MitchLewis930 commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading